---
title: "Data Codebook"
output: github_document
---

Here is an overview of the datasets and their variable names:

# Parquet files

.parquet files are large datasets and represent the main cast vote record data.

    ## ├── by-votechoice
    ## ├── by-person-ID
    ## ├── maricopa
    ## │   └── ballots_wide_party.csv.gz
    ## └── palmbeach
    ## |   └── herron_lewis_counts.csv.gz
    ## ├── maryland
    ## │   ├── elec=2016
    ## │   │   └── part-0.parquet
    ## │   ├── elec=2018
    ## │   │   └── part-0.parquet
    ## │   ├── elec=2020
    ## │   │   └── part-0.parquet
    ## │   └── elec=2022
    ## │       └── part-0.parquet

`by-votechoice` is stored in long form where one row is a vote choice for a particular office. 

* `elec`: South Carolina Election
* `voter_id`: Voter ID I assigned within election
* `Dvoter`: Proportion of votes for Democrat statewide, defined as $D$ in paper
* `top_party2`: The top of the ticket vote (President / US-Senate), defined in paper
* `top_party2_alt`: An alternative measure defined in paper Fig. B5
* `office`: Down-ballot office
* `party`: Party choice for down-ballot office. -1: Democrat, 1: Republican, 0.5: Third party, 0 for other. Only available if contest is contested by more than one major party.
* `ncand`: Number of major party candidates running in that contest
* `jID`: Contest ID
* `open`: Whether the contest is open, i.e. there is no incumbent
* `copar`: The interaction of `party` and `top_party2`, i.e. a straight vote
* `inc_copar`: The interaction of `copar` and incumbency of the candidate.


`by-person-ID` indicates contest IDs for each voters. 

* `voter_id`, `elec`: Same as above
* `precinct_id`: An identifier for precinct
* `county`: County
* `USH_jID`, `HOU_jID`, `SEN_jID`, `SHF_jID`, `CCD_jID`, `JPR_jID`: Contest IDs for six offices.


`maricopa`, `maryland`, and `palmbeach` are cast vote records from the three other states I examine in the paper. These are organized in wide format, so each row represents a voter instead of a choice on the long ballot. These are analyzed in scripts starting with `10_`.


# CSV files

.csv files are typically small datasets with metadata and summary statistics specific to a particular script. 

    ## ├── by-HOU-USH-dist_split.csv
    ## ├── by-contest_cand-metadata.csv
    ## ├── deluca_quality.csv
    ## ├── hist-elecs_by-office.csv
    
`by-contest_cand-metadata.csv` gets used the most frequently for candidate-level metadata in South Carolina. Sources: originates from South Carolina Election Commission, my own data collection, and DIME (version 3.1). Limited to D vs. R contests. Used in almost all scripts.

* `elec`, `office`, `jID`, `county`, `open`: Same as above but for candidates
* `dist`: District number if applicable.
* `n_dr`: Number of D and R candidates.
* `row_id_R`, `row_id_D`: Row ID for candidates 
* `cand_R`,  `cand_D`: Names of candidates
* `incumbency_R`, `incumbency_D`: Incumbency indicator for candidates
* `hits_R`, `hits_D`: Number of newspaper mentions in term, described in main text.
* `money_R`, `money_D`: Dollars raised, described in main text
* `Rmoneyadv`, `Rnewsadv`: Log ratios of hits and money, described in section A5. 

`deluca_quality.csv` is a small dataset from DeLuca, described in main text. Used in Figure 5.

* `state`, `elec`, `office`, `dist`: Same as above
* `top_office`: Reference office used to define ticket splitting
* `split_for_D`, `split_for_R`: split ticket rates from CVR
* `straight_for_D`, `straight_for_R`: straight ticket from CVR
* `quality_differential`: DeLuca's main measure of quality differential

`by-HOU-USH-dist_split.csv` is a small file estimating the split ticket rate for State House. 

* `elec`, `year`: Election year
* `USH_dist`, `HOU_dist`: Combination of US House and State House district IDs
* `hstraight`: Proportion of voters voting the same party in the two pairs of contests
* `n`: Number of voters


`hist-elecs_by-office.csv` is a dataset of historical elections. Appendix A6. Data collected by David Lublin, Carl Klarner, and supplemented by myself. The table bins the contest-level dataset into bins.

* `office`: Office examined
* `yr_bin`: The year range covered
* `n`: Number of contests
* `pct_R`, `pct_D`: Percent won by Republican, Democrat
* `mar_R`: Win margin of Republican


# Stata dta files
.dta files are also for one-off uses of survey data. Sources: ANES and CCES.

    ## ├── hist-svy_anes.dta
    ## ├── hist-svy_cces.dta
    ## ├── hist-svy_cd-2020.dta
    ## ├── hist-svy_cd.dta
    

`hist-svy_anes.dta` is an extract from the ANES cumulative file, coded similar to Jacobson (2015, JOP)'s replication data. Figure 1.

* `year`: Year, VCF0004 in ANES
* `weight`: Survey weight, VCF0009x in ANES
* `id`: Respondent ID, VCF0006a in ANES
* `state`: state, VCF0901b in ANES
* `cd`: congressional district, combination of state and  VCF0900c
* `hvote`: House vote, 1 for Democrat and 0 for Republican, VCF0704 in ANES
* `pvote`: Presidential vote, coded the same way   
* `straight`: Interaction of `hvote`  and `pvote`, straight ticket
* `VCF0704`: Original House vote coding 
* `VCF0707`: Original Presidential vote coding.

`hist-svy_cces.dta` is an extract from the CCES cumulative file, <https://doi.org/10.7910/DVN/II2DB6>. Figure 1.

* `year`, `case_id`, `weight`, `cd`: Year, respondent ID, weight, and district, as described in CCES cumulative codebook.
* `pres_v`, `pres_i`: Presidential vote for post-election vote (v) and pre-election intent (i)
* `rep_v`, `rep_i`: Same but for US House
* `straight`, `split`: Interaction between `pres_v` and `pres_i`


`hist-svy_cd.dta` is CD-level information about the contestedness of each district in each election. We limit our analysis to these districts. Figure 1.

* `year`, `st`, `cd`: District identifiers
* `contes` contested by D and R candidate. My coding and Jacobson's coding for earlier years.

`hist-svy_cd-2020.dta` is a single-year file indicate the Presidential vote in each district. Used in Figure B7 as a robustness check. `pct_trump20` is the voteshare for Trump in 2020, and `pct_trump16` is the voteshare for Trump in 2016.


# Rds files
.rds files are cluster objects estimated from the clusterCVR described in the main text.

    ## ├── clusters
    ## │   ├── by-K
    ## │   │   ├── D12_list.rds
    ## │   │   ├── D16_list.rds
    ## │   │   ├── R12_list.rds
    ## │   │   ├── R16_list.rds
    ## │   │   ├── p12_list.rds
    ## │   │   └── p16_list.rds
    ## │   ├── p12_D-subset_k4.rds
    ## │   ├── p12_R-subset_k4.rds
    ## │   ├── p16_D-subset_k4.rds
    ## │   └── p16_R-subset_k4.rds
    
Each file with subset `k4` is an output of the clusterCVR package using the same 2012 and 2016 parquet data as above.  Used in Figure 3. See the documentation of the package at <https://github.com/kuriwaki/clustercvr> for details on the output. Overall, the main parameters are

* `pi`: estimates of bloc sizes
* `mu`: array of vote choice for each office in each bloc (cluster)
* `loglik`: The log likelihood fit of the final iterations

`D` indicates Clinton/Obama voters in 2012 or 2016, and `R` indicates Trump/Romney voters. These are subsets of the data.

Each file in `_list` follows the convention of the rest of the files but show summary statistics of the log likelihood fit for a series of clusters, 2 to 10. Each item in the list represents a choice of the cluster value K. Used in Appendix C, Figure C1. 
